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Abstract —We describe a method of automatic feedback provi¬ 
sion for students learning programming and oomputational meth¬ 
ods in Python. We have implemented, used and refined this 
system sinoe 2009 for growing student numbers, and summarise 
the design and experience of using it. The oore idea is to use 
a unit testing framework: the teacher creates a set of unit tests, 
and the student code is tested by running these tests. With our 
implementation, students typically submit work for assessment, 
and receive feedback by email within a few minutes after submis¬ 
sion. The choice of tests and the reporting back to the student 
is chosen to optimise the educational value for the students. 
The system very significantly reduces the staff time required to 
establish whether a student's solution is correct, and shifts the 
emphasis of computing laboratory student contact time from as¬ 
sessing correotness to providing guidance. The self-paced nature 
of the automatic feedback provision supports a student-centred 
learning approach. Students can re-submit their work repeatedly 
and iteratively improve their solution, and enjoy using the system. 
We include an evaluation of the system and data from using it in 
a class of 425 students. 

Index Terms —Automatic assessment tools, automatic feed¬ 
back provision, programming education, Python, self-assessment 
technology 

1 Introduction 
1.1 Context 

Programming skills are key for software engineer¬ 
ing and computer science but increasingly relevant 
for computational science outside computer science 
as well, for example in engineering, natural and 
social science, mathematics and economics. The 
learning and teaching of programming is a critical 


part of a computer science degree and becoming 
more and more important in taught and research 
degrees of other disciplines. 

This paper focuses on an automatic submission, 
testing and feedback provision system that has 
been designed, implemented, used and further de¬ 
veloped at the University of Southampton since 
2009 for undergraduate and postgraduate program¬ 
ming courses. While in this setting, the primary 
target group of students were engineers, the same 
system could be used to benefit the learning of 
computer science students. 

1.2 Effective teaching of programming skiiis 

One of the underpirming skills for computer sci¬ 
ence, software engineering and computational sci¬ 
ence is programming. A thorough treatment of the 
existing literature on teaching introductory pro¬ 
gramming was given by Pears et al. ||T|, while a 
previous review focused mainly on novice pro¬ 
gramming and topics related to novice teaching 
and learning |j2|. Here, we motivate the use of 
an automatic assessment and feedback system in 
the context of teaching introductory programming 
skills. 

Programming is a creative task: given the con¬ 
straints of the programming language to be used, it 
is the choice of the programmer what data structure 
to use, what control flow to implement, what pro¬ 
gramming paradigm to use, how to name variables 




and functions, how to document the code, and how 
to structure the code that solves the problem into 
smaller units (which potentially could be re-used). 
Experienced programmers value this freedom and 
gain satisfaction from developing a 'beautiful' piece 
of code or finding an 'elegant' solution. For begin¬ 
ners (and teachers) the variety of 'correct' solutions 
can be a challenge. 

Given a particular problem (or student exercise), 
for example to compute the solution of an ordinary 
differential equation, there are a number of criteria 
that can be used to assess the computer program 
that solves the problem: 

1) correctness: does the code produce the cor¬ 
rect answer? (For numerical problems, this 
requires some care: for the example of the 
differential equation, we would expect for 
a well-behaved differential equation that the 
numerical solution converges towards the ex¬ 
act solution as the step-width is reduced to¬ 
wards zero.) 

2) execution time performance: how fast is the 
solution computed? 

3) memory consumption: how much RAM is 
required to compute the solution? 

4) robustness: how robust is the implementation 
with respect to missing/incorrect input val¬ 
ues, etc? 

5) elegance, readability, documentation: how 
long is the code? Is it easy for others to 
understand? Is it easy to extend? Is it well 
documented, or is the choice of algorithm, 
data structures and naming of objects suffi¬ 
cient to document what it does? 

The first aspect - correctness - is probably most 
important: it is better to have a slow piece of code 
that produces the correct answer, than to have one 
that is very fast but produces a wrong answer. 
When teaching and providing feedback, in particu¬ 
lar to beginners, one tends to focus on correctness 
of the solution. However, the other criteria to 
are also important. 

We demonstrate in this paper that the assessment 
of criteria to can be automated in day-to- 
day teaching of large groups of student. While the 
higher-level aspects such as elegance, readability 
and documentation of item do require manual 
inspection of the code from an experienced pro¬ 
grammer, we find that the teaching of the high 
level aspects benefits significantly from automatic 
feedback as all the contact time with experienced 


staff can be dedicated to those points, and no time 
is required to check the criteria gtoi 

1.3 Automatic feedback provision and assess¬ 
ment 

Over the past two decades interest has been rapidly 
growing in utilising new technologies to enhance 
the learning and feedback provision processes in 
higher education. In 1997, Price and Petre con¬ 
sidered the importance of feedback from an in¬ 
structor to students learning programming, espe¬ 
cially looking into how electronic assignment han¬ 
dling can contribute to Internet-based teaching of 
programming El. Their study compares feedback 
given manually by several instructors to cohorts of 
conventional and Internet learning students, only 
a small fraction of which involved running the 
students' submissions. For the functional program¬ 
ming language Scheme, Saikkonen et al. described 
a system that assesses programming exercises with 
the possibility to analyse individual procedures and 
metrics such as run time IH. A feedback system 
called "submit" for code in Java was introduced 
in 2003, which worked by allowing users to up¬ 
load code, which would be compiled and (if the 
compilation was successful) run, with the output 
displayed for comparison with model output pro¬ 
vided by the lecturer; the lecturer would manually 
grade the work later, and the system would also 
display this information El- Recognising the pop¬ 
ularity of test-driven development and adopting 
that approach in programming courses, Stephen 
Edwards implemented a system, web-CAT, that 
would assess both the tests and the code written 
by students EJ. Shortly thereafter, another group 
produced a tool for automatically assessing the 
style of C++ programs |[7|, which students were 
encouraged to use, and which was also used be 
instructors when manually assessing assignments; 
it was found that the students started to follow 
many important style guidelines once the tool was 
made available. 

By 2005 there was sufficient interest in the field 
of automatic assessment systems that multiple re¬ 
views were published El/ 0/ highlighting the 
emergence of evidence that automatic assessment 
can lead to increased student performance ITOl , 
in. Another benefit realised with automatic as¬ 
sessment systems is greater ease in detecting pla¬ 
giarism, tools for the purpose having been included 
in several of the systems surveyed. Also reported 
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on that year was CourseMarker 1T2I , which can 
mark C++ and Java programs, and uses a Java client 
program to provide a graphical user interface to 
students. 

A more recent review of aufomafic assessmenf 
sysfems IfT^ which highlighfed newer developmenf 
recommended fhaf fufure sysfems devofe more af- 
fenfion fo securify, and fufure liferafure describe 
more complefely how fhe sysfems work. A work 
from MIT CSAIL and Microsoff infroduces a model 
in which fhe sysfem - provided wifh a reference 
implemenfafion of a solufion, and an error model 
consisfing of pofenfial correcfion fo errors fhaf sfu- 
denfs may make - aufomafically derives minimal 
correcfions fo sfudenfs' incorrecf solufions ITU . An- 
ofher relafively recenf developmenf is fhe adopfion 
of disfribufed, web-based framing and assessmenf 
sysfems 03, as well as fhe increasingly-popular 
"massive open online courses" or MOOCs ||T6l . 
A currenf irmovafion in fhe field is fhe nbgrader 
projecf M, an open-source projecf fhaf is designed 
for generafing and grading assignmenfs in IPyfhon 
nofebooks HTSH . 

1.4 Outline 

In fhis work, we describe mofivafion, design, im¬ 
plemenfafion and effecfiveness of an aufomafic 
feedback sysfem for Pyfhon programming exercises 
used in undergraduafe feaching for engineers. We 
aim fo address fhe shorfcomings of fhe currenf 
liferafure as ouflined in fhe review ITU by defailing 
our implemenfafion and securify model, as well 
as providing sample fesfing scripfs, inpufs and 
oufpufs, and usage dafa from fhe deployed sysfem. 
We combine fhe provision of fhe fechnical soffware 
engineering defails of fhe fesfing and feedback sys¬ 
fem, wifh mofivafion and explanation of ifs use in a 
educational seffing, and dafa on sfudenf recepfion 
based on 6 years of experience of employing fhe 
sysfem in multiple courses and counfries. 

In Sec. we provide some historic confexf of 
how programming was faughf prior fhe infroduc- 
fion of fhe aufomafic fesfing sysfem described here. 
Sec. i infroduces fhe new mefhod of feedback 
provision, initially from fhe sfudenf's perspective 
- who are fhe users from a soffware engineering 
poinf of view - fhen providing more defail on 
design and implemenfafion. Based on our use of 
fhe sysfem over multiple years, we have composed 
resulfs, sfafisfics and a discussion of fhe sysfem in 
Sec. before we close wifh a summary in Sec. 


2 Traditional DELIVERY OF PROGRAM¬ 
MING EDUCATION 

In fhis section, we describe fhe learning and feach¬ 
ing mefhods used in fhe Engineering degree pro¬ 
grammes af fhe Universify of Soufhampfon before 
fhe aufomafic feedback sysfem was infroduced. 

2.1 Programming languages used 

We faughf languages such as C and MATLAB 
fo sfudenfs in Engineering as fheir firsf program¬ 
ming languages until 2004, when we infroduced 
Pyfhon IIT9II info fhe curriculum. Over time, we 
have moved fo feaching Pyfhon as a versatile lan¬ 
guage |l20|, II2TI fhaf is relafively easy fo learn ll^ 
and useful in wide variefy of applicafions EU , Il24l . 
We teach C for advanced sfudenfs in lafer years as 
a compiled and fasf language. 

2.2 Lectures 

Lectures that introduce a programming language to 
beginners are t5rpically scheduled over a duration 
of 12 weeks, with two 45 minute lectures per week. 
This is combined with a scheduled computing lab¬ 
oratory (90 minutes) every week (Sec. |2.3| , and 
an additional and optional weekly "help session" 
(Sec. 1^ 

The lectures introduce new material, demonstrate 
what one can do with new commands, and how to 
use programming elements or numerical methods. 
In nearly all lectures, new commands and features 
are used and demonstrated by the lecturer in live- 
coding of small programs; often with involvement 
of the students. The lectures are thus a mixture 
of traditional lectures and a tutorial-like compo¬ 
nent where the new material is applied to solve 
a problem, and - while only the lecturer has a 
keyboard which drives a computer with display 
output cormected to a data projector - all students 
contribute, or are at least engaged, in the process 
of writing a piece of code. 

2.3 Computing laboratories 

However, for the majority of students the actual 
learning takes place when they carry out program¬ 
ming exercises themselves. 

To facilitate this, computer laboratory sessions 
(90 minutes every week) are arranged in which 
each student has one computer, and works at their 
own pace through a number of exercises. Teaching 
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staff are available during fhe session, and we have 
found fhaf abouf 1 (feaching assisfanf) demonsfra- 
for per 10 sfudenfs is required for fhis sef up. 

The lecfurer and demonsfrafors (eifher academics 
or posfgraduafe sfudenfs) fulfil fhree roles in fhese 
laborafory sessions: 

(i) fo provide help and advice when sfudenfs 
have difficulfies or queries while carrying ouf 
fhe self-paced exercises, 

(ii) fo esfablish whefher a sfudenf's work is cor- 
recf (i.e. does fhe sfudenf's compufer program 
do whaf if is meanf fo do), and 

(iii) fo provide feedback fo fhe sfudenf (in par- 
ficular: whaf fhey should change for fufure 
programs fhey wrife). 

Typically, prior fo infroducing fhe aufomafic fesf- 
ing sysfem in 2009, fhe feaching assisfanfs were 
spending 90% of fheir fime on acfivify (ii), i.e. 
checking sfudenfs' code for correcfness, and fhe 
remaining 10% of fime can be used on (i) and (iii), 
while fhe educational value is overwhelmingly in 
(i) and (iii). 

In practical ferms, fhe assessmenf and feedback 
provision was done in pairs consisfing of one 
demonsfrafor and one sfudenf looking fhrough fhe 
sfudenf's files on fhe sfudenf's compufer af some 
poinf during fhe subsequenf computing laborafory 
session. The feedback and assessmenf was fhus de¬ 
livered one week affer fhe sfudenfs had complefed 
fhe work. 

2.4 Help session 

In fhe weekly volunfary help session, compufers 
and feaching sfaff are available for sfudenfs if 
fhey need supporf exceeding fhe normal provision, 
would like fo discuss fheir solufions in more depfh, 
or seek inspiration and fasks fo sfudy fopics well 
beyond fhe expecfed maferial. 

3 New Method of automatic feed¬ 
back PROVISION 
3.1 Overview 

In 2009, we infroduced an automatic feedback pro¬ 
vision system fhaf checks each sfudenf's code for 
correcfness and provides feedback fo fhe sfudenf 
wifhin a couple of minufes of having complefed 
fhe work. This fakes a huge load off fhe demon¬ 
sfrafors who consequenfly can spend mosf of fheir 
fime helping sfudenfs fo do fhe exercises (ifem 
(i) in Sec. |2.3| and providing addifional feedback 


on complefed and assessed solufions (ifem (iii) in 
Sec. |2.3| . Due fo fhe infroducfion of fhe sysfem 
fhe learning process can be supporfed considerably 
more effectively, and we could reduce fhe num¬ 
ber of demonsfrafors from 1 per 10 sfudenfs as 
we had pre-2009, fo 1 demonsfrafor per 20 sfu¬ 
denfs, and still improve fhe learning experience and 
depfh of maferial covered. There was no change 
fo fhe scheduled learning acfivifies, i.e. fhe weekly 
lecfures (Sec. |2.2| , computing laborafory sessions 
(Sec. |2.3| , and help sessions (Sec. 2.41 remain. 

'Sfudenf's perspective" we show a 


In Sec. 3.2 


f 5 q)ical example of a very simple exercise, along 
wifh correcf and incorrecf solufions, and fhe feed¬ 
back fhaf fhose solufions give rise fo. Lafer secfions 
defail fhe sysfem design and work flow (Sec. 3.31 
and in particular fhe implemenfafion of fhe sfudenf 
code fesfing (Sec. 3.41, wifh reference fo fhis exam¬ 
ple exercise. 


3.2 Student’s perspective 

Once a student completes a programming exercise 
in the computing laboratory session, they send an 
email to a dedicated email account that has been 
created for the teaching course, and attach the file 
containing the code they have written. The subject 
line is used by the student to identify the exercise; 
for example "Lab 4" would identify the 4**^ practical 
session. The system receives the student's email, 
and the next thing that the student sees is an 
automatically generated email confirmation of the 
submission (or, should the submission not be valid, 
an error message is emailed instead, explaining 
why the submission was invalid. Invalid submis¬ 
sion can occur for example if emails are sent from 
email accounts that are not authorised to submit 
code). At this stage, the student's code is enqueued 
for testing, and after a short interval, the student 
receives another email containing their assessment 
results and feedback by email. Where problems are 
detected, this email also includes details of what the 
problems were. Typically, the student will receive 
feedback in their inbox within two to three minutes 
of sending their email. 

We shall use the following example exercise, 
which is t 5 rpical of one that we might use in an 
introductory Python laboratory, as the basis for 
our case study: 
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Please define the following functions in the file 
trainingl.py and make sure they behave as ex¬ 
pected. You also should document them suitably. 

1) A function distance (a, b) that returns the 
distance between numbers a and b. 

2) A function geometric_mean (a, b) that re¬ 
turns the geometric mean of two numbers, i.e. 
the edge length that a square would have so 
that its area equals that of a rectangle with 
sides a and b. 

3) A function pyrainid_volume (A, h) that 
computes and returns the volume of a pyra¬ 
mid with base area A and height h. 


We show a correct solution to question 3 of this 
example exercise in Listing If a student who is 
enrolled on the appropriate course submits this, 
along with correct responses to the other questions, 
by email to the system, they will receive feedback 
as shown in Listing 


def pyramid_volume(A, h): 

"""Calculate and return the volume of a pyramid 
with base area A and height h. 

return (1. /3 .) * A * h 

Listing 1: A correct solution to question 3 of the 
example exercise 


Dear Neil O'Brien, 

Testing of your submitted code has been completed: 
Overview 


test_distance : passed -> 100% ; with weight 1 

test_geometric_mean : passed -> 100% ; with weight 1 

test_pyramid_volume : passed -> 100% ; with weight 1 

Total mark for this assignment: 3 / 3 = 100%. 

(Points computed asl+l+l=3) 


This message has been generated automatically. Should 
you feel that you observe a malfunction of the system, 
or if you wish to speak to a human, please contact the 
course team {course-help@uni.email.address). 

Listing 2: email response to correct submission, 
additional line wrapping due to column width 


If the student submits an incorrect solution, for 
example with a mistake in question 3 as shown 
in Listing they will instead receive the feedback 


shown in Listing Of course the students must 
learn to interpret this style of feedback in order 
to gain the maximum benefit, but this is in itself 
a useful skill, as we discuss more fully in Sec¬ 
tion 4.8.2 and comments from the testing code 
assist the students, as discussed in Section |3.4.5 


The submission in Listing is incorrect because 
integer division is used rather than the required 
floating-point division. These exercises are based on 
Python 2, where the "/” operator represents integer 
division if both operands are of integer t 5 rpe, as 
is common in many programming languages (in 
Python 3, the "/" operator represents floating point 
division even if both operands are of type integer). 


def pyramid_volume(A, h): 

"""Calculate and return the volume of a pyramid 
with base area A and height h. 

return (A * h) / 3 

Listing 3: An incorrect solution to question 3 of the 
example exercise, using integer division 


Within the testing feedback in Listing the 
student code is visible in the name space 
s, i.e. the function s .pyramid_voluine is the 
function defined in Listing The function 
correct_pyrainid_volume is visible to the test 
system but students cannot see the implementation 
in the feedback their receive - this allows us to 
define tests that compute complicated values for 
comparison with those computed by the student's 
submission, without revealing the implementation 
of the reference computation to the students. 

3.3 Design and Implementation 

The design is based on three different processes 
that are started periodically (every minute) and 
communicate via file system based task queues 
with each other: 

1) A incoming queue of incoming student submis¬ 
sions, initial validation and extraction of files 
and required tests to run (see high level flow 
chart in Fig. 0 

2) A queue of outgoing messages that need to 
be delivered to the users and administrators 
which - in our email based user interface 
- decouples the actual testing queue from 
availability of the email servers (flow chart in 
Fig.jig. 
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Dear Neil O'Brien, 

Testing of your submitted code has been completed: 
Overview 


test_distance : passed -> 100% ; with weight 1 

test_geometric_mean : passed -> 100% ; with weight 1 

test_pyramid_volume : failed -> 0% ; with weight 1 

Total mark for this assignment: 2 / 3 = 67%. 

(Points computed asl+l+0=2) 

Test failure report 


test_pyramid_volume 


def test_pyramid_volume(): 

# if height h is zero, expect volume zero 
assert s,pyramid_volume(1.0, 0.0) == 0. 

# tolerance for floating point answers 
eps = le-14 

# if we have base area A=l, height h=l, 

# we expect a volume of 1/3.: 

assert abs(s.pyramid_volume(1., 1.) - 1./3.) < eps 

# another example 
h = 2 . 

A = 4 . 

assert abs(s.pyramid_volume(A, h) - 

correct_pyramid_volume(A, h)) < eps 

# does this also work if arguments are integers? 

> assert abs(s.pyramid_volume(1, 1) - 1. / 3.) < eps 
E assert 0.3333333333333333 < le-14 

E + where 0.3333333333333333 = abs{(0 - (1.0/3.0))) 
E + where 0 = <function pyramid_volume at 
0x7f0celaf4e60>(1, 1) 

E + where <function pyramid_volume at 

0x7fOcelaf4e60> = s.pyramid_volume 

Listing 4: email response to incorrect solution 


3) A queue of tests to be run, where the actual 
testing of the code takes place in a restricted 
environment (flow charf in Fig. 

We describe how fhese work fogefher in more 
defail in fhe following secfions. 

The sysfem is implemenfed in Pyfhon, and pri¬ 
marily fesfs Pyfhon code (in Section |4^ we discuss 
generalisafion of fhe sysfem fo fesf code in ofher 
languages). 

3.3.1 Email receipt and incoming queue process 

Each course fhaf uses fhe aufomafic feedback pro¬ 
vision sysfem has a dedicafed email accounf sef 
up fo receive submissions. Af fhe Universify of 
Soufhampfon, for a course wifh code ABC, fhe 
email address would be ABC@uni.email.address 


As fhe subjecf line, fhe sfudenf has fo use a pre¬ 
defined sfring (such as lab 1), which is specified 
in the assignment instructions, so the testing system 
can identify which submission fhis is. The idenfify 
of fhe sfudent is known fhrough fhe email address 
of fhe sender. 

The fesfing sysfem accesses fhe email inbox ev¬ 
ery minufe, and downloads all incoming mails 
from it using standard tools such as fetchmail, 
or getmail combined with cron. These incom¬ 
ing mails are then processed sequentially as sum¬ 
marised in the flow charf in Fig. 

1) The email is copied, for backup purposes, fo 
an archive of all incoming mail for fhe given 
course and year. 

2) The email is checked for validify in fhe fol¬ 
lowing respecfs: 

a) fhe sfudenf musf be known on fhis 
course (fhis is checked using a lisf of 
sfudenfs enrolled on fhe course, pro¬ 
vided by fhe sfudenf adminisfrafion of¬ 
fice); submissions from sfudenfs who are 
nof enrolled are logged for review by an 
adminisfrafor in case fhe sfudenf lisf was 
nof correcf or a sfudent has transferred 
between courses; 

b) fhe subjecf line of fhe email musf relafe 
fo a valid exercise for fhe course; 

c) all required files musf be attached fo fhe 
email, and fhese musf be named as per 
fhe insfrucfions for fhe exercise. 

3) If fhe email is invalid (i.e., one or more of 
the above criteria are not met), an error re¬ 
port is created and enqueued in the outgoing 
email queue for delivery. The email explains 
why fhe submission is nof valid, invifing fhe 
sfudenf fo correcf fhe problems and re-submif 
fheir work. 

4) For a valid submission, fhe affachmenfs of fhe 
incoming email confaining fhe sfudenf's code 
are saved and 

5) an ifem is placed info fhe fesfing queue, in¬ 
cluding fhe exercise fhaf is fo be fesfed, the 
student's user name, and names and paths of 
the files fhaf were submiffed. 

6 ) For a valid submission, an email fo fhe sfu¬ 
denf is enqueued in fhe oufgoing message 
queue fhaf confirms receipf of fhe submission; 
fhe sfudenf can use fhis fo evidence fheir 
submission and submission time, and if re¬ 
assures fhe sfudenfs fhaf all required files 
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(b) Outgoing email queue process 



(c) Testing queue process 

Fig. 1: Flow charts illustrating the work flow in each process. Processes are triggered every minute via a 
cronjob entry, and don't start until their previous instance has completed. 


were present, and that the submission has 
entered the system. 

7) For both valid and invalid submissions, the 
email is removed from the incoming queue. 


3.3.2 Outgoing messages 

The implementation of sending error messages and 
feedback reports to the students, and any other 
messages to administrators, is realised through 
a separate queue and process for outgoing mes¬ 
sages (see Fig. m and discussion of this de¬ 
sign in Sec. 3.4.71. This process is also used for 
weekly emails informing students about the overall 
progress (Sec. 3.4. 6| . 

We note in passing that all automatically gen¬ 


erated messages invite the student to contact the 
course leader, other teaching staff or the adminis¬ 
trator of the feedback provision system should they 
not understand the email or feel that the system has 
malfunctioned; help can be sought by email or in 
person during the timetabled teaching activities. 


3.4 Design and implementation of student code 
testing 


The testing queue shown in Fig. Ic processes sub¬ 
missions that have been enqueued by the incoming 
mail processing script. The task is to execute a 
number of predefined tests against the student code 
in a secure environment, using unit testing tools to 
establish correctness of the student submission. As 
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we use Python for these courses in computation 
for science and engineering, we can plug into the 
testing capabilities that come with Python, and 
those that are provided by third party tools, such 
as nose 1251 and pytest l26l . We have chosen the 
py.test tool because we have more experience 
with this system. 

Here, we provide a brief overview of the testing 
process which is invoked every minute (unless an 
instance started earlier has not completed execution 
yet), with Sections. 3.4.2 to |3.4.7 providing more 
details on the requirements and chosen design and 
implementation. 

For each testing job found in the queue, the fol¬ 
lowing steps are carried out (see the flow diagram 
in Fig. Icl: 


1 ) 


2 ) 


3) 


4) 


5) 


6 ) 


7) 


The student files to be tested are copied to a 
sand-boxed location on the file system with 
limited access permissions (Sec. 3.4. 1|. 

A dedicated local user with minimal privi¬ 
leges tries to import the code in a Python 
process to check for correct S 5 mtax. 

If the import fails due to S 5 mtax errors an error 
message is prepared for the user and injected 
into the outgoing message queue. (See also 
Sec. 4.4.1| for a discussion.) The job is removed 
from the testing queue and the process moves 
to the next item in the queue. 

If the import succeeds, the tests are run on the 
submitted code in the restricted environment 
(Sec. |3X2| to[3A4l. 

Output files (that the student code may pro¬ 
duce) and testing logs are archived, marks 
extracted and all data are stored in a database 
which may be used by the lecturer to discover 
the marks for each student, for each question 
and assignment. 

A feedback message for the student is pre¬ 
pared and injected into the outgoing message 
queue containing the test results (Sec. 3.4.5| . 
This provides the student with a score for 
each question in the assignment, and where 
mistakes were found, provides details of the 
particular incorrect behaviour that was dis¬ 
covered. Listing 1^ shows an example of such 
feedback. 

The test job is removed from the queue. 


We discussion additional weekly feedback to stu¬ 
dents in Sec. 3.4.6 and the system's dependability 
in Sec. iTiTI 


3.4.1 Security measures 

By the nature of the testing system, it contains 
student data (names, email addresses, and submis¬ 
sions), and it is incumbent upon the developers and 
administrators to take all reasonable measures to 
safeguard these data against unauthorised disclo¬ 
sure or modification. We also require the system 
to maintain a high availability and reliability. The 
risks that we need to guard against can largely be 
divided into two categories: (i) genuine mistakes 
made by students in their code, and (ii) attempts 
by students - or others who have somehow gained 
access to a student's email account - to intentionally 
access or change their own or other students' work, 
assigned marks, or other parts of the testing system. 

Experience shows that some of the most common 
genuine mistakes made by students include cases 
such as unterminated loops, which would execute 
indefinitely. Due to the serialisation of the tests in 
our system, this problem, if left unchecked, would 
stop the system processing any further submissions 
until an administrator corrected it. However, we 
have applied a POSIX resource limit 1271 , l28l on 
CPU time to ensure that student work consuming 
more than a reasonable and fixed limit is termi¬ 
nated by the system. We catch any such termina¬ 
tions, and in this case we have adopted a policy of 
informing the student by email, and giving them 
the opportunity to re-submit an amended version 
of their work. We apply similar resource limits on 
both disk space consumption and virtual memory 
size, in order that loops which would output large 
amounts of data to stdout, stderr, or a file on 
disk, or which interminably append to a list or 
array resulting in its consumption of unreason¬ 
able amounts of memory, are also prevented from 
causing an undue impact on the testing machine's 
resources. 

We address the potential that submitted code 
could attempt to maliciously access data about 
another student (or parts of the system) with a 
multi-faceted approach: 

1) We execute the tests on the student code 
under a separate local user account on the 
server that performs the tests. This account 
has minimal permissions on the file system. 

2) We create a separate directory for each sub¬ 
mission that we test, and run the tests within 
this directory. 

3) The result of the two previous points, as¬ 
suming that all relevant file system permis- 
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sions are configured correctly, means that no 
student submission may read or modify any 
other student's submissions or marks, nor can 
it read the code comprising the testing system. 

4) The environment variables available to pro¬ 
cesses rurming as the test user are limited to 
a small set of pre-defined variables, so that no 
sensitive data will be disclosed through that 
mechanism. 

5) We do not provide the students information 
about the file system layout, local account 
names, etc. on the host that runs the tests, to 
reduce the chance that students know of the 
locations of sensitive data on the file system. 

3.4.2 Iterative testing of student code 

We have split the exercises on our courses into 
questions, and arranged to test each question sep¬ 
arately. Within a question, the testing process stops 
if any of the test criteria are not satisfied. This 
approach was picked to encourage an iterative 
process whereby students are guided to focus on 
one mistake at a time, correct it, and get further 
feedback, which improves the learning experience. 
This approach is similar to that taken by Tillmann 
et al. 12^ , where the iterative process of supplying 
code that works towards the behaviour of a model 
solution for a given exercise is so close to gaming 
that it "is viewed by users as a game, with a 
byproduct of learning". Our process resembles test- 
driven development strategies and familiarises the 
students with test-driven development 1301 in a 
practical way. 

3.4.3 Defining the tests 

There are an indefinite number of both correct and 
incorrect ways to answer an exercise, and to test 
correctness using a regression testing framework 
requires some skill and experience in constructing 
a suitably rigorous test case for the exercise. We 
build on our experience before and after the in¬ 
troduction of the testing system, ongoing feedback 
from interacting with the students and reviewing 
their submissions to design the best possible unit 
testing for the learning experience. This includes 
testing for correctness but also structuring tests in a 
didactically meaningful order. Comments added in 
the testing code will be visible to the students when 
a test fails, and can be used to provide guidance to 
the learners as to what is tested for, and what the 
cause of any failure may be (if desired). 


Considering question 3) in the example exercise 
we introduced in Section 3.2 the tests that we carry 
out on the student's function include the following: 


1) Volume must be 0 when h is 0. 

2) Volume must be 0 when A is 0. 

3) If we have A = 1 and h = 3, volume must 
be 1. 

4) If we have A = 3 and h = 1, volume must 
be 1. 

5) If we have A = 1.0 (as a float) and h = 1 . 0 
(as a float), volume must be |. 

6 ) If we test another combination of values of 
floating-point numbers A and h then the re¬ 
turned volume must be A * h / 3.0. 

7) If we have A = 1 (as an integer) and h = 1 
(as an integer), volume must be 

8 ) The function must have a documentation 
string; this must contain several words, one 
of which is "return". 

In this very simple example, we set up the first 
group of criteria (1-6) to determine that the student 
has implemented the correct formula to solve the 
problem at hand. Criterion 7 tests for the common 
mistake of using integer division where floating¬ 
point division is required. The final criteria concern 
coding style. In this example, it is a strict require¬ 
ment that the code is documented to at least some 
minimal standard, and the student will gain no 
marks for a question that is answered without a 
suitable documentation string. 

Our implementation of the tests described above 
is given in Listing In implementing these criteria, 
we avoid testing for exact equality of floating point 
numbers at any point in the testing process. Instead 
we define some tolerance (e.g. eps = le-14), and 
require that the magnitude of the difference be¬ 
tween the result of the student's code and the re¬ 
quired answer be below this tolerance. This avoids 
failing student submissions which have e.g. per¬ 
formed accumulation operations in a different or¬ 
der and concomitantly suffered differing floating¬ 
point round-off effects. As exercises become more 
complex and related to numerical methods, a dif¬ 
ferent tolerance may have to be chosen. 

We order the criteria so those that are most 
likely to pass are tested earlier, and we have cho¬ 
sen to stop the testing process at the first error 
encountered. This encourages students to address 
and correct one error at a time in an iterative 
process, if required, which is possible thanks to 
the short timescale between their submitting work 
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def test_pyramid_volume(): 

# if height h is zero, expect volume zero 
assert s,pyramid_volume(1.0, 0.0) == 0. 

# if base A is zero, expect volume zero 
assert s.pyramid_volume(0.0, 1.0) == 0. 

# if base has area A=l, and the height is h=3, 

# we expect a volume of 1: 

assert s.pyramid_volume(1.0, 3.0) == 1. 

# if base has area A=3, and the height is h=l, 

# we expect a volume of 1: 

assert s.pyramid_volume(3.0, 1.0) == 1. 

^acceptable tolerance for floating point answers 
eps = le-14 

# if base has area A=l, and the height is h=l, 

# we expect a volume of 1/3.: 

assert abs{s.pyramid_volume(1., 1.) - 1./3.) < eps 

# another example 
h = 2 . 

A = 4 . 

assert abs(s.pyramid_volume(A, h) - 

correct_pyramid_volume(A, h)) < eps 

# does this also work if arguments are integers? 
eps = le-14 

assert abs(s.pyramid_volume(1, 1) - 1./3.) < eps 

# is the function documented well 
docstring_test(s.pyramid_volume) 

Listing 5: testing code for example question 


and receiving feedback (see Sec. 
plemenfafion of fhe fesfs for py 


3A2) . 


The im- 
test is based 


on assert sfafemenfs, which are True when fhe 
sfudenf's code passes fhe relevanf fesf, and False 
ofherwise. The final criferia, fhaf fhe documenfafion 
sfring musf exisf and pass certain tests, is handled 
by asserting that a custom function that we provide 
to check the documentation string returns True. Of 
course, fhe fesfs musf be developed carefully fo suif 
fhe exercise fhey apply fo, and fo exercise any likely 
weaknesses in fhe sfudenfs' answers, such as fhe 
chance fhaf infeger division would be used in fhe 
implemenfafion of fhe formula for fhe volume of a 
pyramid discussed above. 


3.4.4 Clean code and PEP 8 
In addition to the hard S 5 mtactic requirements of 
a programming language, fhere are offen recom- 
mendafions how fo sfyle and lay-ouf source code. 
We find fhaf if is very efficienf fo infroduce fhis fo 
sfudenfs from fhe very beginning of fheir program¬ 
ming learning journey. 

For Pyfhon, fhe so-called "PEP 8 Sfyle Guide" for 
Pyfhon Code IlSTl is useful guidance, and elecfronic 


fools are available fo check fhaf code follows fhese 
volunfary recommendations for clean code. PEP 8 
has recommendations for fhe number of spaces 
around operafors, before and affer commas, fhe 
number of empfy lines befween funcfions, class 
definitions, efc. 

We use fhe pep8 ufilify 13211 to assess the con¬ 
formance of fhe sfudenf's entire submission file 
(which will usually consisf of answers fo several 
questions like fhe above) wifh fhe PEP 8 Sfyle 
Guide. Our sysfem counfs fhe number of errors fhaf 
are found, A^err/ and penalises fhe sfudenf's fofal 
score according fo a policy (e.g. we may choose a 
policy of mulfiplying fhe raw mark fhaf could be 
obfained for full PEP 8 conformance by 2“^'", or of 
implementing any other desired mark adjustment 
as a function of fhaf value). 

3.4.5 Results and feedback provision to student 

The resulfs of fhe fesfing process are written fo 
machine-readable files by py. test. Por each fesfed 
submission, fhe reporf is parsed by our sysfem, 
wifh one of a number of resulfs being possible: fhe 
sfudenf code may have run complefely in which 
case, we have a pass resulf or a fail resulf for each of 
fhe defined fesfs. Ofherwise fhe sfudenf code may 
have ferminated wifh an error which is mosf likely 
due fo a resource limit being exceeded causing the 
operating system to abort the process, as discussed 
in Section 13.4.11 

The number of quesfions fhaf were answered cor- 
recfly (i.e. have no failed assertions in the associated 
tests) is counted and stored in a database. If fhere 
were incorrecf answers, we exfracf a backfrace from 
fhe py. test outpuf which we incorporafe info fhe 
email fhaf is senf info fhe sfudenf. The general 
formal of fhe resulfs email is fo give a per-quesfion 
mark, wifh a fofal mark for fhe submission, and 
fhen fo defail any errors fhat were encounfered. 
In fhe calculation of fhe mark for fhe assessment, 
questions can be given different weights to re¬ 
flect greater importance or challenges of particular 
quesfions. Por fhe example shown in Lisfing all 
quesfions have fhe same weigh! of 1. 

We described and illusfrafed a fypical question, 
which mighf form par! of an assignmenf, in Sec- 
fion |3.2| As shown in Lisfing when an error 
is encounfered, fhe resulfs fhaf are senf fo fhe 
sfudenf include fhe portions of fhe fesfing code for 
fhe quesfion in which fhe error was found fhaf 
have passed successfully, and fhen indicafe wifh 
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the > character the line whose assertion failed (in 
this case the 7th-last line shown). This is followed 
by a backfrace which illusfrafes fhaf, in fhis case, 
fhe submiffed pyramid_voluine funcfion refumed 
0 when if was expecfed fo refurn an answer of 
I ± 1 X The reporf also includes several 

commenfs, which are infroduced in fhe fesfing code 
(shown in Lisfing|^, and assisf sfudenfs in working 
ouf whaf was being fesfed when fhe error was 
found. Here, fhe commenf "does fhis also work if 
argumenfs are infegers?" shows fhe sfudenf fhaf we 
are abouf fo fesf fheir work wifh infeger parame- 
fers; fhaf should prompf fhem fo check for infeger 
division operafions. If fhey do nof succeed in doing 
fhis, fhey will be able fo show fheir feedback fo 
a demonsfrafor or academic, who can can use fhe 
feedback fo locafe fhe error in fhe sfudenf's code 
swiffly, fhen help fhe sfudenf find fhe problem, and 
discuss ways fo improve fhe code. 


3.4.6 Statistical reporting to lecturers and routine 
performance feedback to student 
The sysfem records all perfinenf dafa abouf each 
submission including fhe user who made fhe sub¬ 
mission, as well as fhe dafe and fime of fhe submis¬ 
sion, and fhe mark awarded. We use fhis dafa fo 
furfher engage sfudenfs wifh fhe learning process, 
by sending ouf a weekly email summary of fheir 
performance fo dafe, as shown in Listing This 
includes a line for each exercise whose deadline 
has passed, which reminds fhe sfudenf of fheir 
mark and whefher fheir submission was on fime 
or nof. For a sfudenf who has submitted no work, 
a differenf reminder is senf ouf, requesfing fhey 
submif work, and giving confacf defails of fhe 
course leader, asking fhem fo make confacf if fhey 
are experiencing problems. Messages are senf via 
fhe oufgoing queue (Sec. 3.3.2| . 

We also monifor missirig submissions in fhe firsf 
couple of weeks very carefully and confacf sfu¬ 
denfs individually who appear nof fo have sub¬ 
miffed any work. Occasionally, fhey are regisfered 
on fhe wrong course, buf similarly some sfudenfs 
jusf need a little bif of exfra help wifh fheir firsf 
ever programming exercises and by expecfing fhe 
firsf submitted work af fhe end of fhe firsf or 
second feaching week, we can infervene early in 
fhe semesfer and help fhose sfudenfs gef sfarfed 
wifh fhe exercises and follow fhe remainder of fhe 
course. 

After fhe deadline for each sef of exercises, fhe 
course lecfurers will generally flick fhrough fhe 


Dear Neil O'Brien, 

Please find below your summary of submissions and 
preliminary marks for the weekly laboratory sessions 


for 

course 

■ ABC, 

as of Fri 

Jan 30 17:06:44 2015. 

lab 

2 : 

25% 

Details: 

1.00/ 4.00, 

submitted before deadline 

lab 

3 : 

31% 

Details: 

1.25 / 4.00, 

submitted before deadline 

lab 

4 : 

0% 

Details: 

4.00 / 4.00, but submission 

at 2014-11-14 20:39:02 was 
late by 4:39:02. 

lab 

5 : 

80% 

Details: 

4.00 / 5.00, 

submitted before deadline 

lab 

6 : 

77% 

Details: 

3.06 / 4.00, 

submitted before deadline 

lab 

7 : 

75% 

Details: 

3.00 / 4.00, 

submitted before deadline 


The average mark over the listed labs is 48%. 
With kind regards. 

The teaching team (course-help@uni.email.address) 


Listing 6: Typical routine feedback email 


code fhaf sfudenfs have submiffed (or af leasf 10 fo 
20 randomly chosen submissions if fhe number of 
sfudenfs is large). This helps fhe feacher in idenfi- 
fying fypical pafferns and misfakes in fhe sfudenfs' 
solutions, which can be discussed, analysed and 
improved effectively in fhe nexf lecfure: once all 
sfudenf specific defails are removed from fhe code 
(such as name, login and email address), submiffed 
(and anonymised) code can be shown in fhe nexf 
lecfure. We find fhaf sfudenfs clearly enjoy fhis kind 
of discussion and code review joinfly carried ouf 
by sfudenfs and lecfurer in fhe lecfure fheafre, in 
particular where fhere is fhe possibilify fhaf fheir 
anonymised code is being shown (alfhough only 
fhey would know). 

The dafa for fhe performance of fhe whole class 
is made available fo fhe lecfurer fhrough privafe 
web pages which allow quick navigation fo each 
sfudenf and all fheir submissions, files and resulfs. 
Key dafa is also made available as a spreadsheef, 
and a number of graphs showing fhe submission 
acfivify (some are shown in Figs. 1^ 1^ and 1^ and 
discussed in section]^. 

3.4.7 Dependability and resilience 

The submission sysfem is a critical piece of infras- 
frucfure for fhe delivery of fhose courses fhaf have 
adopfed if as fheir marking and feedback sysfem; 
fhis means fhaf ifs reliabilify and availabilify musf 
be maximised. We have faken several measures fo 
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reduce the risk of downtime and service outages, 
and also to reduce the risk of dafa loss fo a low 
level. 

The machine on which fhe sysfem is insfalled 
is a virfual machine which is hosfed on cenfrally 
managed Universify infrasfrucfure. This promises 
good physical securify for fhe hosf machines, and 
high-availabilify feafures of fhe hypervisor, such 
as live migration 1331 , improve resilience againsf 
possible individual hardware failures. To combaf 
fhe possibilify of fhe dafa (especially fhe sfudenf 
submissions) being losf we have insfifufed a mulfi- 
tier backup sysfem, which backs up fhe sysfem's 
dafa fo multiple physical locations and fo mulfiple 
desfinafion storage media, so fhat fhe probabilify 
of losing dafa should be very small. 

The remaining pofenfial single poinf of failure is 
fhe Universify's email sysfem, which is required for 
any sfudenf fo be able fo submif work or receive 
feedback. In fhe case fhaf fhe email sysfem were 
fo fail close fo a deadline, we would have fhe 
choice of exfending fhe deadline fo allow submis¬ 
sions affer the service was restored, and / or manual 
intervention to update marks where students could 
demonstrate that their submission was ready on 
time, depending on the lecturer's chosen policy. 

The internal architecture of fhe fesfing sysfem 
was designed fo be as resilienf as possible, and 
fo limif fhe pofenfial impacf of any faulfs. A key 
approach fo fhis goal is fhe use of various (file 
sysfem based) queues (Fig. fhaf decouple fhe 
differenf sfages of submission handling and fesfing 
so fhaf e.g. a failure of fhe sysfem's abilify fo 
deliver emails would nof impede fesfing submis¬ 
sions already received. Emails are received info 
a local mailbox and are processed one ifem af a 
time so fhis is fhe firsf effecfive queue; receipf of 
emails can continue even if fhe fesfing process has 
halfed. Valid submissions from processed emails 
are fhen entered info a queue for fesfing, fhe enfries 
of which are processed sequenfially. The receipf 
and fesfing processes generate emails, which are 
placed info an oufgoing mail queue and are sent 
regularly, the queue items being removed only after 
successful fransmission. This way, if fhe oufgoing 
email service is unavailable, mails will accumulafe 
in fhe queue and be senf en masse when fhe service 
is restored. 

Anofher key design decision was fhaf each in¬ 
dividual parf of fhe receipf and fesfing process is 
carried ouf sequenfially for each submission and is 


protected by lock files. Prior fo processing received 
messages, fhe sysfems checks for existing locks; if 
fhese exisf, fhe processing doesn'f sfarf, and fhe 
evenf is logged (receipf of ofher emails continues 
as fhe receiving and processing are separate pro¬ 
cesses). If no locks are found, a lock is creafed, 
which is removed upon successful processing; any 
unexpecfed ferminafion of fhe processing code will 
resulf in a lock file being leff behind, so fhaf we 
can invesfigafe whaf wenf wrong and make any 
required correcfions before resfarfing fhe sysfem. 
The fesfing process ifself is likewise profecfed by 
locking. A separafe wafch dog process alerfs fhe 
adminisfrafor if lock files have sfayed in place for 
more fhan a few minutes - f 5 rpically each process 
complefes wifhin a minufe. 

In practice we have developed fhe sysfem fo fhe 
poinf fhaf we have nof had an unexpecfed failure 
require us fo manually clean up and unlock in two 
years of producfion use, buf should an unexpecfed 
bug be found, fhis design ensures fhaf af mosf one 
submission will be affecfed (a copy will have been 
made before any processing was carried ouf, so 
even in this case there would be no loss of sfudent 
data). 

4 Results 

4.1 Testing system deployment 

The automatic testing system was first used at the 
University of Soufhampfon's Highfield Campus in 
fhe academic year 2009/2010 for feaching abouf 85 
Aerospace engineers, and has been used every year 
since for growing sfudenf numbers, reaching 425 
sfudenfs in 2014/2015. The Soufhampfon deploy- 
menf now additionally serves anofher cohorf of sfu¬ 
denfs who sfudy af fhe Universify of Soufhampfon 
Malaysia Campus (USMC) and fhere is a furfher 
deploymenf af fhe Indian Insfifufe of Technology 
(IIT) Mandi and Madras campuses, where fhe sys¬ 
fem has been infegrafed wifh fhe Moodle learning 
managemenf sysfem, as described in Section [43| 

The fesfing sysfem has also been used in a num¬ 
ber of smaller courses af Soufhampfon, fypically 
of approximafely 20 sfudenfs, such as one-week 
intensive Pyfhon programming courses offered fo 
PhD sfudenfs. If also serves Soufhampfon's courses 
in advanced compufafional mefhods where around 
100 sfudenfs have submiffed assignmenfs in C, as 
described in Section |4^ 
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Training Assignment 

Voluntary, formative feedback & assessment 
(not contributing to final mark) 

Exercises: Question T1, Question T2, ... 

V_^^_ J 


Laboratory Assignment 

Compulsory, summative feedback & assessment 
(first submission contributing to final mark) 

Exercises: Question LI, Question L2, ... 

V_^^_ J 

Fig. 2: Overview of the structure of the weekly com¬ 
puter laboratory session: A voluntary set of training 
exercises is offered to the students as a "training" 
assignment on which they receive feedback and a 
mark, followed by a compulsory set of exercises in 
the same topic area as the "laboratory" assignment 
which is marked and contributes to each student's 
final mark for the course. Automatic feedback is 
provided for both assignments and repeat submis¬ 
sions are invited. 


4.2 Case study: Introduction to Computing 

In this section, we present and discuss experience 
and pertinent statistics from the production usage 
of the system in teaching our first-year computing 
course, in which programming is a key component. 
In 2014/15, there were about 425 students in their 
first semester of studying Acoustic Engineering, 
Aerospace Engineering, Mechanical Engineering, 
and Ship Science. 

4.2.1 Course structure 

The course is delivered through weekly lectures 
(Sec. |2.2[ and weekly self-paced student exercise 
(Sec. |2.3| with a completion deadline a day before 
the next lecture takes place (to allow the lecturer 
to sight submissions and provide generic feedback 
in the lecture the next day). Students are offered 
a 90 minute slot (which is called "computing lab¬ 
oratory" in Southampton) in which they can carry 
out the exercises, and teaching staff are available to 
provide help. Students are allowed and able to start 
the exercise before that laboratory session, and use 
the submission and testing system anytime before, 
during and after that 90 minute slot. 


Each weekly exercise is split into two assign¬ 
ments: a set of "training" exercises and a set of 
assessed "laboratory" exercises. This is summarised 
in Eig. ig 

The training assignment is checked for correct¬ 
ness and marked using the automatic system, but 
whilst we record the results and feed back to the 
students, they do not influence the students' grades 
for the course. Training exercises are voluntary 
but the students are encouraged to complete them 
in order to practice the skills they are currently 
learning and prepare for the following assessed 
exercise which tests broadly similar skills. 

Students can repeatedly re-submit their (modi¬ 
fied) code for example until they have removed all 
errors from the code. Or they may wish to submit 
different implementations to get feedback on those. 

The assessed laboratory assignment is the sec¬ 
ond part of each week's exercises. Eor these, the 
students attempt to develop a solution as perfect 
as possible before submitting this by email to the 
testing system. This "laboratory" submission is as¬ 
sessed and marks and feedback are provided to the 
student. These marks are recorded as the student's 
mark for that week's exercises, and contribute to 
the final course mark. The student is allowed (and 
encouraged) to submit further solutions, which will 
be assessed and feedback provided, but it is the first 
submission that is recorded as the student's mark 
for that laboratory. 

The main assessment of the course is done 
through a programming exam at the end of the 
semester in which students write code on a com¬ 
puter in a 90 minute session, without Internet ac¬ 
cess but having an editor and Python interpreter to 
test the code they write. Each weekly assignment 
contributes of the order of one percent to the final 
mark, i.e. 10% overall for a 10 week course. Each 
laboratory session can be seen as a training oppor¬ 
tunity for the exam as the format and expectations 
are similar. 

4.2.2 Student behaviour: exploiting learning oppor¬ 
tunities from multiple submissions 

In Eigure we illustrate the distribution of sub¬ 
mission counts for "training 2", which is the vol¬ 
untary set of exercises from week 2 of the course. 

The bar labelled 1 with height 92 shows that 92 
students have submitted the training assignment 
exactly once, the bar labelled 2 shows that 76 stu¬ 
dents submitted their training assignment exactly 
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Fig. 3: Histogram illustrating the distribution of 
submission counts per student for the (a) voluntary 
training and (b) assessed laboratory assignment 
(see text in Sec. 4.2.2 


twice, and so on. The sum over all bars is 316 and 
shows the total number of sfudenfs participating 
in fhis volunfary fraining assignment. 87 students 
submitted four or more fimes, and several sfudenfs 
submitted 10 or more fimes. This illusfrafes fhat 
our concepf of sfudenfs being free fo make sfep- 
wise improvemenfs where needed and rapidly gef 
furfher feedback has been successfully realised. 

We can confrasf fhis fo Figure |3b[ which shows 
the same data for fhe compulsory laborafory assign- 
menf in week 2 ('Tab2"). This submission atfracfs 
marks which confribufe fo fhe sfudenfs' overall 
grades for fhe course. In fhis case fhe sfudenfs are 
advised fhaf while fhey are free fo submif mulfiple 
fimes for furfher feedback, only fhe mark recorded 
for fheir firsf submission will counf fowards fheir 
score for fhe course. For lab 2, 423 sfudenfs sub¬ 
miffed work, of whom 314 submitted once only. 


However, 64 sfudenfs submitted a second revised 
version and a significanf minorify of 45 sfudenfs 
submitted fhree or more fimes fo avail fhemselves 
of fhe benefifs of further feedback after revising 
fheir submissions, even fhough fhe subsequenf sub¬ 
missions do nof affecf fheir mark. 

Significanf numbers of sfudenfs choose fo submif 
fheir work for bofh volunfary and compulsory as- 
signmenfs repeafedly demonsfrafing fhaf fhe sys- 
fem offers fhe sfudenfs an exfended learning op- 
porfunify fhaf fhe convenfional cycle of submiffing 
work once, having if marked once by a human, and 
moving to the next exercise, does not provide. 

The proportion of sfudenfs submiffing mulfiple 
fimes for fhe assessed laborafory assignment (Fig¬ 
ure 3b I is smaller than for fhe training exercise 
(Figure [3a) and likely to highlight the difference 
befween fhe sfudenfs' approaches fo formative and 
summative assessment. It is also possible that stu¬ 
dents need more iterations to learn new concepts 
in the training assignment before applying fhe new 
knowledge in fhe laborafory assignmenf, confribuf- 
ing fo fhe difference in resubmissions. The larger 
number of sfudenfs submitting for fhe assessed as¬ 
signmenf (423 « 100%) over fhe number of sfudenfs 
submitting for fhe training assignment (316 « 74%) 
shows that the incentive of having a mark con¬ 
fribufe fo fheir overall grade is a powerful one. 


4.2.3 Student behaviour: timing of submissions 

In Figure [^ we show fhe submission timelines for 
all fhe volunfary "fraining" assignmenfs fhaf fhe 
sfudenfs were offered every week. There are fen 
such scheduled assignmenfs in fofal, and for each 
a line is shown. The assignmenfs may be identified 
by fheir chronological sequence, as discussed in fhe 
following paragraphs. 

In Figure [^ we show fhe same dafa buf for fhe 
compulsory and assessed laborafory assignmenfs 
(see Fig. [2] and Sec. [4.2.1 for a defatted explanation 
of fhe "fraining" and "laborafory" assignmenfs). 

Plof a) in Figure [^ and a) in Figure [^ show 
fhe "unique" sfudenf submissions counfs for every 
exercise. Wifh unique, we mean fhaf only fhe firsf 
submission fhaf any individual sfudenf makes for 
a given assigmnenf is counfed in fhe graph. On fhe 
confrary subplofs b) in Figure [^ and b) in Figure [^ 
show fhe "non-unique" submissions fhaf include 
every submission made, even repeal submissions 
from any parficular sfudenf for fhe same assign¬ 
menf. 
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(a) Unique submissions of voluntary training exercises, showing the number of students participat¬ 
ing as a function of time. 
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(b) Non-unique submissions of voluntary training exercises, showing the total number of submis¬ 
sions as a fimction of time. 


Fig. 4: Submissions of voluntary training assignments as a function of time for (a) unique student 
participation for each assignment, (b) total number of submissions for each assignment. Labels L1...L10 
and associated dashed vertical lines indicate time-tabled computing laboratory sessions 1 to 10 at 
Southampton; EA - end of autumn term; SS - start of spring term (Christmas break is between these 
dates); EX - exam. (See Sec. 4.2.3 for details.) 
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(a) Unique submissions of compulsory assessed laboratory assignments, showing the number of 
students participating as a function of time. 
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(b) Non-unique submissions of compulsory assessed laboratory assignments, showing the total 
number of submissions as a function of time. 

Fig. 5: Submissions of compulsory and assessed laboratory assignments as a function of time for (a) 
unique student participation for each assignment, (b) total number of submissions for each assignment. 
Labels as in Fig. Additional labels S2 to SIO (vertical dashed lines) indicate submission deadlines in 
Southampton; M2 to MIO (vertical dash-dotted lines in lower part of plot) show submission deadlines 
for students in Malaysia. 
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The unique plots allow us to gauge the total 
number of students submitting work to a given 
assignment (as a function of fime), and fhe non¬ 
unique plofs allow us fo see fhe fofal number of 
submissions made by fhe enfirefy of fhe sfudenf 
body fogefher. 

We discuss fhe labels and annofafions in Figure 
firsf, buf fhey apply similarly fo Figure The 
dashed vertical lines represenf fime-fabled compuf- 
ing laboratory sessions lasting 90 minufes where 
fhe sfudenfs are invited to carry out the voluntary 
training and assessed laboratory assignment for 
that week in the presence of and wifh supporf from 
teaching sfaff. These fime-fabled sessions, in which 
every sfudenf has a compufer available fo wrife 
fheir code, are labelled LI fo LIO in fhe figures. 

The coloured symbols which are connecfed by 
sfraighf lines counf fhe number of submissions. 
In Figure]^ a), fhe 'Training 1" assignmenf submis¬ 
sions are shown in blue, fhe nexf week's "framing 
2 " assignmenf submissions are shown in green, 
efc. There are fen scheduled laboratory sessions, 
and fen associated volunfary framing assignmenfs. 
There is one additional assignmenf af the end of 
fhe course which is offered fo help revision for fhe 
exam, shown on fhe very righf in Figure a) and 
b) in yellow wifhouf symbols. 

Figure shows submission counfs for fhe com¬ 
pulsory assessed laborafory exercises. There were 9 
such assessed assignmenfs, sfarfing in fhe second 
week of fhe course: while fhere is a volunfary 
training assignment in week 1, there is no as¬ 
sessed assignment laboratory assignment to give 
the students some time to familiarise fhemselves 
with the teaching material and submission system. 
From week 2 onward, there is one voluntary train¬ 
ing assignment and one assessed assignment every 
week up to and including week 10. The students 
were given submission deadlines for fhe assessed 
laborafory assignmenfs, and fhese deadlines for 
Soufhampfon sfudenfs are shown in fhe plofs as 
doffed vertical lines labelled S2 to SIO. 

The course was delivered simultaneously at 
the University of Soufhampton (UoS) Highfield 
Campus in the United Kingdom - where about 
400 students were taught - and the University 
of Soufhampfon Malaysia Campus (USMC) in 
Malaysia - where a smaller group of abouf 25 
sfudenfs was faughf. While following fhe same 
lecfure material and assignmenfs, fhese two cam¬ 
puses, due fo differenf local arrangemenfs and fime 


zones, faughf fhe course fo differenf schedules, and 
fhe effecf of fhis division is visible in all of fhe 
figures. The Malaysia sfudenfs have differenf dead¬ 
lines from fhe Soufhampfon sfudenfs, and fhese are 
shown as shorfer vertical dash-dotted lines, labeled 
M2 to to MIO, towards the bottom of each plof 
in Figure The deadlines of sfudenfs in Malaysia 
(M) and Soufhampfon (S) follow local holidays and 
ofher consfrainfs, alfhough fhey often fully coincide 
(S6 fo S9), or are delayed by one week (S2 fo S5). 

We now discuss fhe acfual dafa presented, start¬ 
ing with the voluntary training assignment submis¬ 
sions in Figure Looking at Figure we see that 
the first training exercise had the largest number of 
submissions of any of fhe framing exercises. Abouf 
300 of fhese submissions occurred during fhe firsf 
hands-on faughf session LI, reflecfing a large num¬ 
ber of sfudenfs who followed fhe recommended 
learning procedure of complefing fhe volunfary 
exercises and doing so during fhe compufing lab¬ 
orafory session in fhe presence of feaching sfaff, 
and who had sufficienf resource and insfrucfion 
available fo do so. 

The corresponding bursf of submissions during 
fhe compufing laborafory sessions L2 and L3 has 
decreased fo abouf 175 and 150 submissions, re- 
specfively. The fofal number of sfudenfs parficipaf- 
ing in fhese volunfary assignmenfs in fhe firsf fhree 
weeks decreases from abouf 400 in week 1, fo abouf 
310 and abouf 270 in weeks 2 and 3, respectively 
The fofal number of unique submissions reaches ifs 
minimum of abouf 80 in week 4 (fhe purple dafa 
sef associated wifh hands-on session L4), and then 
starts to increase again for fhe remainder of fhe 
course. 

We see in Figure |5 a) fhaf fhe compulsory sub¬ 
missions remain high, so fhaf fhis drop in fhe 
volunfary submissions is no reason for concern, and 
may reflecf fhaf sfudenfs undersfand fhe learning 
mefhods, and opfions for learning acfivifies fhaf 
suif fheir own preferences and sfrengfhs. The dafa 
may also suggesf an opporfunify fo make fhe as¬ 
signmenfs slighfly more challenging as sfudenfs 
seem fo feel very confidenf in fackling fhem. 

In addition to the burst of submissions during 
fhe fime-fabled sessions LI fo LIO, we also nofe 
a significanf number of submissions bofh before 
and affer fhese sessions in Figure a) and b), 
reemphasising fhe flexibilify fhaf fhe sysfem af¬ 
fords sfudenfs as fo where and when fhey submif 
fheir work. Anecdofal evidence, wriffen feedback 
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from the students (Sec. 4.31 combined with the 


submission data suggests that some students will 
do the exercises as soon as they become available, 
and others prefer to do this during the weekend 
or evening hours. Many students see the offered 
computing laboratory sessions as an opportunity 
to seek support which they make use of if they feel 
this will benefit their learning. 

Figure shows that as the examination date 
(labelled as E at the right-hand side of the graphs) 
approached, a relatively small number of students 
started to submit solutions to the training exercises 
they had not submitted before as part of their 
revision and exam preparation.. The same tendency 
is visible in Figure]^ with a slightly larger increase 
due to repeat submissions that cannot be seen in the 
graph in Fig. |^. 


We now discuss Figu re which shows the same 
type of data as Figure 1^ but for the compulsory 
assessed laboratory assignments rather than volun¬ 
tary training assignments. The most notable dif¬ 
ference is that the total number of submissions 
remained high for all the assignments, reflecting 
that these assignments are not voluntary and do 
contribute to the final course mark. There is a 
slow decline of submissions present (from about 
425 to 375 during the course, corresponding to 
approximately 10%) which is not unexpected and 
includes students leaving their degree programme 
studies altogether, suspending on health reasons, 
etc. 

The vast majority of first submissions for the 
compulsory laboratory assignments, which con¬ 
tribute to the overall course marks, occur in ad¬ 
vance of the deadline, as illustrated in Figure 
where the deadlines are shown as vertical dotted 
lines. 

The assessed assignment timings in Figure 
show that submissions take place in different 
phases. The trend is visible in all the lines, but 
most clearly where the submission deadlines in 
Malaysia coincide with those in Southampton, i.e. 
laboratory sessions 6, 7, 8 and 9: the first sub¬ 
missions are received after the assignments have 
been published, and then a steady stream of sub¬ 
missions comes in, leading to an approximately 
straight diagonal line in Figure The second set 
of submissions is received during the associated 
laboratory session (shown as dashed line) where 
many students complete the work in the timetabled 
session. Following that, there is again a steady 


stream of submissions up to the actual deadline 
(shown as dotted line) where submissions accumu¬ 
late. Very few (first) submissions are received after 
the deadline. The submissions in the second phase 
can be used to estimate student attendance in the 
laboratory sessions (see discussion in 4.8.41. 

The University of Southampton Christmas break 
is also apparent, a period during which there 
are few new unique submissions (Fig. 1^), but 
slightly more new non-unique submissions (Fig.|^) 
from students revising over the holiday and re¬ 
submitting assignments they had submitted before. 
It is reasonable to assume that they have re-written 
the code as an exam preparation exercise. 

Trends seen in the voluntary submission data 
in Figure 1^ such as a notable rise in the non-unique 
(i.e. repeat) submissions across all assignments in 
the days leading up to the exam, are also evident 
in Figure 


4.3 Feedback from students 

While overall ratings of our courses using the au¬ 
tomatic testing and feedback system are very good, 
it is hard to distinguish the effect of the testing 
system from that of, for example, an enthusiastic 
team of teachers, that would also achieve good 
ratings when using more conventional assessment 
and feedback methods. 

We invited feedback explicitly on the automatic 
feedback system asking for voluntary provision 
of (i) reasons why students liked the system and 
(ii) reasons why students disliked the system. The 
replies are not homogeneous enough to compile 
statistical summaries, but we provide a representa¬ 
tive selection of comments we have received below. 

4.3.1 I like the testing system because... 

The following items of feedback were given by the 
students when offered to complete the sentence "I 
like the testing system because. .." as part of the course 
evaluation: 

1 ) because we can get quick feedback 

2) it is very quick 

3) it provides a quick response 

4) immediate effect 

5) quick response 

6) it gives very quick feedback on whether code has 
the desired effect 

7) it provides speedy feedback, even if zvorking at 
home in the evening 
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8) it worked and you could submit and re-submit at 
your own pace 

9) I like the introduction to the idea of automated unit 
testing. 

10) concise, straight to the point, no mess, no fuss. 
"Got an error? Here's where it is. FIX IT!" 

11) it was easy to read output to find bugs in programs 

12) you can see where you went wrong 

13) very informative, quick response 

14) it reassures me quickly about what I do 

15) it gave quick feedback and allowed for quick re¬ 
assessment once changes were made 

16) feedback on quality of the code 

17) it is fast and easy to use 

18) it indicates where the errors are and we can submit 
our work as many times as we want 

19) it is quick and automatic 

20) it is automated and impartial 

21) gives quick feedback, for training lets you test 
things quickly 

22) it saves time and can give feedback very quickly. 
The re-submission of training exercises is very 
useful. 


We briefly summarise and discuss these points: 
the most frequent student feedback is on fhe im- 
mediafe feedback fhaf fhe sysfem provides. Some 
sfudenf commenfs mention explicitly the usefulness 
of fhe sysfem's feedback which allows fo idenfify 
the errors they have made more easily (items [T0}[TT| 


12 In addition to these generic endorsements. 


some students mention explicitly advantages of 
fhe fesf-driven developmenf such as re-assurance 
regarding correcfness of code (ifem [T4| , quick feed¬ 
back on refactoring (T^ , fhe indirecf infroducfion 
of unif fesfs fhrough fhe sysfem ||^, and help in 
writing clean code 1161. If is worfh noting fhaf Agile 
mefhods and fesf-driven developmenf have nof 
been introduced to the students at the time where 
they have provided the above feedback. Furfher 
sfudenf feedback welcomes fhe abilify fo re-submif 
code repeafedly (ifemsJS 15 221 and fhe flexibilify 
fo do so af any fime Q. Inferesfingly, one sfudenf 
menfions fhe objecfiveness of fhe sysfem < [20) - 
presumably fhis commenf is based on experience 
wifh assessmenf systems where a sef of markers 
manually assess submissions which nafurally dis¬ 
play some variefy in rigour and fhe application of 
marking guidelines. 


4.3.2 I dislike the testing system because... 

The following items of feedback were given by fhe 
sfudenfs when offered fo complete fhe sentence "I 
dislike the testing system because..." as parf of fhe 
course evaluation: 

1 ) error messages not easy to understand 

2) it takes some time to understand how to interpret 
it 

3) sometimes difficult to understand what was wrong 

4) it complains (gives failures) for picky reasons like 
wrong function names and missing docstrings. 
That's not a complaint, it is only a machine. 

5) it is a bit unforgiving 

6) it is extremely [strict] about PEP 8 

7) tiny errors in functions would result in complete 
failure of test. 

Several comments (items atoiD state that the 
feedback from fhe aufomafic testing sysfem is hard 
fo undersfand. This refers fo fesf-failure reporfs 
such as shown in Listing Indeed, fhe learning 
curve af fhe beginning of fhe course is quite high: 
fhe firsf 90 minufe lecfure infroduces Pyfhon, Hello 
World and functions, and demonsfrafes feedback 
from fhe fesfing sysfem fo prepare sfudenfs for fheir 
self-paced exercises and fhe automatic feedback 
fhey will receive. However, a systematic explana¬ 
tion of fhe assert sfafemenfs. True and False 
values, and exceptions fakes only place after fhe 
sfudenfs have used fhe fesfing sysfem repeafedly. 
The reading of error messages is of course a key 
skill (and fhe imporfance of fhis is often underes- 
fimafed by fhese non-compufer science sfudenfs), 
and we like fo fhink fhaf fhe early infroducfion of 
error messages from fhe aufomafic fesfing is overall 
quite useful. In practice, mosf sfudenfs use fhe 
hands-on compufing laborafory sessions fo learn 
and undersfand fhe error messages wifh fhe help 
of teaching sfaff before fhese are covered in greater 
defail in the lectures. See also Sec. 

A second set of commenfs relates fo fhe harsh¬ 
ness and unforgiving nafure of fhe aufomafic fesfs 
(ifems 1^ fo 0. Ifem refers fo fhe assessmenf 
mefhod of nof awarding any poinfs for one of 
mulfiple exercises fhat form an assignmenf if there 
is any mistake in the exercise, and is a criticism 
regarding the assessment as part of fhe learning 
process. 

For ifems H] fo |6] if is nof clear whefher fhese 
sfafemenfs relafe fo fhe feedback on fhe code or 
fhe assessmenf. If fhe commenfs relafe fo fhe code. 
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then they reflect a lack of understanding (and thus a 
shortcoming in our teaching) of fhe imporfance of 
documenting code and fhe imporfance of geffing 
everything right in developing software (and not 
just approximately right). 


4.3.3 Generic comments 


The following commenfs on fhe feedback sysfem 
were provided by sfudenfs unprompfed, i.e. as parf 
of generic feedback on fhe course, and are in-line 
with the more detailed points made above: 

1) Fantastic real-time feedback ivith online submis¬ 
sion of exercises. 

2) Loved the online submission. 

3) Really like the online submission system with very 
quick feedback. 

4) Description in the feedback by automated system 
can be unclear. 

5) Instant feedback on lab and training exercises was 
welcome. 

6 ) Autotesting feature is VERY useful! Keep it and 
extend it! 

7) The automatic feedback is fairly useful, once you 
have worked out how to understand it. 


In the context of enfhusiasfic endorsemenfs of 
fhe fesfing sysfem, we like fo add our subjective 
observafion from feaching fhe course fhaf many 
sfudenfs seem fo regard fhe process of making fheir 
code pass fhe aufomafic fesfs as a challenge or game 
which fhey play againsf fhe fesfing sysfem, and thaf 
fhey experience greaf enjoymenf when fhey pass all 
fhe fesfs - be if in fhe flrsf or a repeaf submission 
(see also Section 3.4.2|. As sfudenfs like fhis game, 
fhey very much look forward fo being able fo sfarf 
fhe nexf sef of exercises which is a greaf mofivafion 
fo acfively follow and parficipafe in all fhe feaching 
activities. 


4.4 Issues 

During fhe years of using fhe aufomafic fesfing 
sysfem, we have experienced a number of issues 
which are unique fo fhe aufomafed mefhod of 
assessmenf described here. We summarise fhem 
and our response fo each challenge below. 

4.4.1 Submissions inciuding syntax errors 
When a sfudenf submifs a file confaining a S 5 mfax 
error, our fesfing code (here driven by fhe py. test 
framework) is unable fo imporf fhe submission, 
and fherefore fesfing cannof commence. Techni¬ 
cally, such a submission is nof a valid Pyfhon 


program (because if confains at least one S 5 mtax 
error). We ask the students to always test their work 
thoroughly before submitting, which should defecf 
S 5 mfax errors flrsf, and such submissions should nof 
occur. 

However, in practice, and given fhe large number 
of submissions (abouf 20 assignmenfs per sfudenf, 
and currently 500 students per year), occasionally 
students will either forego fhe fesfing to save time, 
or will inadvertently introduce S 5 mtax errors such 
as additional spaces or indentation between check¬ 
ing their work and submitting it. From a purely 
technical point of view, fhe sysfem is able fo recog¬ 
nise fhis sifuafion when if arises and we could sfafe 
fhaf any such submission is incorrecf, and fherefore 
assign a zero mark. However, fhese submissions 
may represenf signiflcanf efforf and confain a lof 
of valid code (for mulfiple exercises submiffed in 
one file), so we have adopfed a policy of allow¬ 
ing re-submission in such a scenario: if a S 5 mfax 
error is defecfed on imporf of fhe submission, fhe 
sfudenf is aufomafically informed abouf fhis, and 
re-submission is invifed. 

4.4.2 Submissions in undeciared non-ASCII char¬ 
acter encoding 

We noticed an increasing frend, especially among 
international students, for submitting files in 8- 
bif characfer sefs ofher fhan ASCII. Such files are 
accepfed by fhe Pyfhon 2 inferprefer so long as 
fhe encoding is declared in fhe flrsf lines of fhe 
file according fo fhe PEP263 Il34l : buf many of 
fhe sfudenfs who were using non-ASCII characfers 
were nof describing fheir encodings af all. Our flrsf 
response was fo updafe our sysfem fo check for 
fhis sifuafion, and upon discovering if, to send an 
automated email to the student concerned with a 
suggestion that they declare their encoding and re¬ 
submit. More recently, we have began recommend¬ 
ing the use of fhe Spyder 1351 environmenf, whose 
defaulf behaviour is fo armofafe fhe encoding of 
fhe file in quesfion in a PEP 263-complianf marmer. 
This has now virfually eliminafed fhe occurrence of 
characfer encoding issues. Por fhe few cases where 
fhese sfill arise, fhe aufomafic suggesfion email, 
and (if required) personal supporf in scheduled 
laborafory and help sessions enables fhe sfudenfs 
fo undersfand and overcome fhe issue. 


4.4.3 PEP 8 styie checker issues 

As described in Secfion 3.4.3 we fake advanfage of 

fhe pep8 ufilify l32l fo assess fhe conformance of 
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the students' submissions against the style recom¬ 
mendations of PEP 8. 

Students find following sfyle guidelines a lof 
harder fhan adapfing fo hard S 5 mfacfic and seman¬ 
tic requiremenfs of fhe programming language as 
fhey can solve fhe given exercises so fhaf fheir 
code exhibifs correcf functional behaviour while nof 
necessarily following fhe sfyle guidelines. In our ex¬ 
perience, if is crifical fo help sfudenfs fo adapf fheir 
own sfyle habifs fo recommended guidelines, for 
example fhrough fools fhaf flag up non-confirming 
consfrucfs immediafely while editing code. One 
such freely available fool for Pyfhon is fhe Spyder 
IIS51 developmenf environmenf, for which PEP 8- 
compafibilify highlighting can be acfivafed l36l . By 
encouraging all sfudenfs fo use fhis environmenf 
- af universify machines and in insfallafions of 
fhe software on fheir own machines for which we 
provide recommendations Il37l - we find fhaf fhey 
generally pick up fhe PEP 8 guidelines quickly. 
As wifh so many fhings, if infroduced early on, 
fhey soon embrace fhe approach and use if wifhouf 
addifional efforf in fhe fufure. Consequenfly we 
penalise submissions fhaf are nof PEP 8 complianf 
from fhe second week onward. 

One issue fhaf arises wifh infegrafing PEP 8 
guidelines info fhe assessmenf is fhaf differenf 
software release versions of fhe pep8 fool may 
yield differenf numbers of warnings; fhis is parfly 
due fo changes in fhe view on whaf represenfs 
good coding sfyle over time and partly due to 
bugs being fixed in fhe pep8 fool ifself. This can 
resulf in unexpecfed warnings from fhe PEP 8- 
relafed fesfs. As a pracfical measure, we ensured 
fhaf we are using fhe lafesf version of fhe pep8 
checking fool, and have elecfed fo omif fhose fesfs 
fhaf are freafed differenfly by ofher recenf versions. 
The sfudent body will generally reporf any such 
deviations befween fhe PEP 8 behaviour on fheir 
own compufer and fhe fesfing sysfem, and help in 
idenfifying any pofenfial problems here. 

4.5 Integration with Moodle 

Moodle (Modular Object-Oriented Dynamic Learn¬ 
ing Environment) 1381 is a widely-used open source 
learning management system which can be used 
to deliver course content and host online learning 
activities. It is designed to support both teaching 
and learning activities. The Indian Institute of Tech¬ 
nology (IIT) Mandi and IIT Madras use Moodle 
to manage the courses at the institute level. When 


rurming a course, instructors can add resources 
and activities for their students to complete, e.g. 
a simple page with downloadable documents or 
submission of the assignments by prescribed time 
and date. 


It was envisaged that integrating the automatic 
feedback provision system with Moodle would 
simplify the use of the automatic feedback system 
for IIT instructors and students, by allowing to 
submit and retrieve feedback through the Moodle 
interface that they use routinely already instead 
of using email, thus replacing the incoming queue 
process (Eig. [Ta) . Outgoing messages to adminis¬ 
trators are still emailed using the outgoing email 
queue (Eig. lb I. The testing process queue (Eig. Ic I 
is used as in the Southampton deployment that is 
described in the main part of this paper. 


In integrating the assessment system with the IIT 
Moodle deployment, we have used the Sharable 
Content Object Reference Model, SCORM, which is 
a set of technical standards for e-learning software 
products. The user front end is provided through 
the browser-based Moodle User Interface, while 
scripts at the back end make the cormection to the 
automatic assessment system. The results are then 
fetched from the system and made visible to the 
student and the instructor. Using Moodle also helps 
the IIT to leverage the security that is already a part 
of the SCORM protocols. 


The implementation at IIT is via a Moodle plugin 
designed such that, when a student submits an 
assignment, the plugin collects the global file ID of 
the submission and creates a copy of the file outside 
the Moodle stack. The plugin then invokes a Python 
script through exec (), transferring the location of 
file and file ID to the script. This Python script then 
acts as a user of the automatic feedback and as¬ 
sessment system, and directly enqueues the file for 
processing. The job ID inside automatic assessment 
system engine is returned to the Moodle plugin 
which maintains a database mapping job IDs to 
file IDs. After the file is processed by the automatic 
assessment system, the results are saved as files that 
are named after the (unique) job ID. When students 
access their results through Moodle, the relevant job 
IDs are retrieved from the database, allowing the 
corresponding results file to be opened, converted 
to HTML and published in a new page. 
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4.6 Testing of other languages 

There should be no conceptual barriers for using 
the automatic testing and feedback system for test¬ 
ing of code written in other languages that provide 
unit testing frameworks. In particular JUnit Il39l for 
Java could be used instead of py.test for Java 
programming courses. In this case, the execution of 
the actual tests (and the writing of the tests to run) 
would need to be done in Java, but the remaining 
framework implemented in Python could remain 
(mostly) unchanged, providing the student submis¬ 
sions handling and receipts, separation of testing 
jobs, a limited-privilege, limited-resource runtime 
environment, maintaining a database of results, and 
automatically emailing students their feedback. 

As part of our education programme in com¬ 
putational science Il40ll , we we are interested in 
testing C code that students write in our advanced 
computational methods courses in which they are 
introduced to C programming. Students learn in 
particular how to combine C and Python code 
to benefit from Python's effectiveness as a high 
level language but achieving high execution perfor¬ 
mance by implementing performance critical sec¬ 
tions in C. 

We are exploring a set of light weight options to¬ 
wards automating the testing of the C code within 
the given framework and our education setting: 

1) Firstly, we compile the submitted C code us¬ 
ing gcc, capturing and parsing its standard 
output and standard error to capture the num¬ 
ber of errors and warnings generated. 

2) We then run the generated executable un¬ 
der the same security restrictions as we use 
for Python, capturing its standard output 
and error, and potentially comparing them to 
known-correct examples. 

3) We are also using the ctypes library to make 
functions compiled from students' C code 
available within Python, so that they may be 
tested with tests defined the same way as for 
native Python code (see Listing |^. 

The system that we built for testing Python code is 
modular enough that the above can be incorporated 
into the test work-flow for the courses where it is 
required. We note that it is now necessary to handle 
segmentation faults that may arise from calling the 
student's C code: these may be treated similarly 
to the cases where resource limits are exceeded in 
testing Python code, causing the OS to terminate 
the process; the student's marks may be updated 


if required, or a re-submission invited, in line with 
the course leader's chosen educational policy. 

4.7 Pre-marking exams 

As well as assessing routine laboratory assign¬ 
ments, the system is also used to support exam 
marking. The format of the exam for our first year 
introductory programming course is a 90 minute 
session which the students spend at a computer, in 
a restricted environment. They are given access to 
the Spyder Python development environment to be 
able to write and run code but have no access to 
the Internet, and have to write code to answer exam 
questions which follow the format experienced in 
the weekly assignments. At the end of the exam, all 
the students' code files are collected electronically 
for assessment. We pre-test the exam code files using 
the marking system with an appropriate suite of 
tests, and then distribute the automatically assigned 
marks and detailed test results and the source code 
to the examiners for manual marking. This enables 
the examiners to save significant amounts of time 
because it is immediately apparent when students 
achieve full marks and, where errors are found, the 
system's output assists in swiftly locating them. It 
also increases objectivity compared to leaving all 
the assessment to be done by hand, possibly by a 
team of markers who would each have to interpret 
and apply a mark scheme to the exam code files. 

The system has also been used to receive course- 
work submissions for a course leader who decided 
to exclusively manually assess the work. In this 
case, the system was configured simply to receive 
the submission, identify the user, store the submis¬ 
sion, and log the date and time of submission of 
the coursework. 

4.8 Discussion 

In this section, we discuss key aspects of the de¬ 
sign, use and effectiveness of the automatic testing 
system to support learning of programming. 

4.8.1 Key benefits of automatic testing 
A key benefit of using the automatic testing system 
is to reduce the amount of repeated algorithmic 
work that needs to be carried out by teaching 
staff. In particular establishing the correctness of 
student solutions, and providing basic feedback on 
their code solutions is now virtually free (once the 
testing code has been written) as it can be done 
automatically. 
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This allowed us to very significantly increase the 
number of exercises fhaf sfudenfs carry ouf as parf 
of fhe course, which helped fhe sfudenfs fo more 
actively engage wifh fhe confenf and resulfed in 
deeper learning and greafer sfudenf safisfacfion. 

The marking sysfem frees feaching sfaff fime fhaf 
would ofherwise have been devofed fo manual 
marking, and which can now be used fo repeaf 
maferial where necessary, explain concepfs, discuss 
elegance, cleanness, readabilify and effectiveness of 
code, and suggesf alfernafive or advanced solufion 
designs fo fhose who are inferesfed, wifhouf having 
fo increase fhe number of confacf hours. 

Because of fhe more effective learning fhrough 
active self-paced exercises, we have also been able 
fo increase fhe breadfh and depfh of maferials in 
some of our courses wifhouf increasing confacf fime 
or sfudenf fime devofed fo fhe course. 

4.8.2 Quality of automatic feedback provision 

The qualify of fhe feedback provision involves two 
main aspecfs: (i) fhe fimeliness, and (ii) fhe useful¬ 
ness, of fhe feedback. 

The sysfem f 5 rpically provides feedback fo sfu¬ 
denfs wifhin 2 fo 3 minufes of fheir submission 
(inclusive of an email round-frip fime on fhe order 
of a couple of minufes). This speed of feedback pro¬ 
vision allows and encourages sfudenfs fo iferafively 
improve submissions where problems are defecfed, 
addressing one issue af a fime, and learning from 
fheir misfakes each fime. 

This near-insfanf feedback is almosf as good as 
one could hope for, and is a very dramafic im- 
provemenf on fhe sifuafion wifhouf fhe sysfem in 
place (where fhe provision of feedback would be 
wifhin a week of fhe deadline, when an academic 
or demonsfrafor is available in fhe nexf practical 
laborafory session). 

The usefulness of fhe feedback is dependenf 
upon fhe sfudenf's abilify fo undersfand if, and fhis 
is a skill fhaf fakes fime and practice fo acquire. 
We elecfed fo use fhe fraceback oufpuf provided 
by py.test in fhe feedback emails fhaf are senf 
fo sfudenfs in fhe case of a fesf failure, as per fhe 
example in Listing The fraceback, combined wifh 
our helpful commenfs in fhe fesf definifions, allows 
a sfudenf fo undersfand under precisely which 
circumsfances fheir code failed, and also fo under¬ 
sfand why we are fesfing wifh fhaf particular sef of 
paramefers. Alfhough inferprefing fhe fracebacks is 
nof a skill fhaf is immediafely obvious, especially 


fo sfudenfs who have never programmed before, 
if is a skill fhaf is usually quickly acquired, and 
one which all compefenf programmers should be 
well-versed in. We suggesf fhaf if is an advanfage 
fo encourage sfudenfs fo develop fhis abilify af an 
early sfage of fheir learning. Sfudenfs af Soufhamp- 
fon are well-supporfed in acquiring fhese skills, 
including fimefabled weekly laboratories and help 
sessions sfaffed by academics and demonsfrafors. 
Once fhe sfudenfs masfer reading fhe oufpuf, fhe 
usefulness of fhe feedback is very good: if pinpoinfs 
exacfly where fhe error was found, and provides 
fhe rationale for fhe choice of fesf case as well. 

A fhird aspecf of fhe qualify of feedback and 
assessmenf is objecfivify. Because all of fhe submis¬ 
sions are tested fo fhe same criferia, fhe sysfem also 
improves fhe objecfivify of our marking compared 
fo having several people each inferprefing fhe mark 
scheme and applying fheir inferprefafions fo sfu¬ 
denf work. 

4.8.3 Flexible learning opportunities 

A furfher enhancemenf fo fhe sfudenf experience 
is fhaf fhe sysfem allows and promofes flexible 
working. Feedback is available fo sfudenfs from 
an 5 rwhere in fhe world (assuming fhey have In- 
femef access), af any fime of day or nighf, rafher 
fhan being resfricfed fo fhe locations and hours 
fhaf laborafory sessions are scheduled. This means 
fhaf fhe mosf confidenf sfudenfs are free fo work 
af fheir own pace and convenience. Those sfudenfs 
who wish for more guidance and supporf can avail 
fhemselves of fhe full resources in fhe fime fabled 
sessions. All fhe sfudenfs can repeaf framing exer¬ 
cises multiple times, dealing wifh an error af once, 
when errors are discovered. They may also repeaf 
and re-submif assessed laborafory exercises fo gain 
addifional feedback and deeper undersfanding, buf 
in line wifh our policies, fhis does nof change fheir 
recorded marks for assessed work. 

4.8.4 Large classes 

We have found fhaf fhe assessmenf sysfem is in¬ 
valuable as our sfudenf numbers grow befween 
years. Once exercises and didacfic fesfing code are 
developed, fhe automatic fesfing and feedback pro¬ 
vision does nof require addifional sfaff fime fo pro¬ 
cess, assess and feedback on sfudenf submissions 
when sfudenf numbers grow from year fo year. 
Addifional feaching sfaff in fhe pracfical sessions 
are required fo mainfain fhe sfudenf-sfaff rafio, buf 
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the automatic system reduces the overall burden 
very significantly, and has helped us to deliver the 
training in the face of an increase from 85 fo 425 
sfudenfs enrolled in our firsf-year infroducfion fo 
compufing course. 

The flexible learning fhaf fhe sysfem allows (see 
Secf. 4.8.3| holds opporfunifies for more efficienf 
space use. In fhe weekly hands-on compufing lab- 
orafories, we currenfly provide all sfudenfs fheir 
own compufer for 90 minufes in fhe presence of 
feaching sfaff. Wifh large sfudenf numbers, depend¬ 
ing on fhe local facilifies, fhis can become a fime 
fabling and resource challenge. 

We know from sfudenf affendance behaviour 
fhaf fhe firsf two weeks see nearly all sfudenfs 
affending fhe hands-on compufing laboratories, buf 
fhaf sfudenf affendance in fhe compufing labora¬ 
tories declines significanfly affer week fwo, as - 
for example - fhe besf sfudenfs will offen have 
complefed and submitted fhe exercise before fhe 
fime-fabled laborafory session, and some sfudenfs 
will only come fo fhe laborafory session fo gef help 
on a parficular problem fhaf fhey could nof solve 
on fheir own; needing 15 minufes affendance rafher 
fhan 90. As a resulf, if should be possible fo 'over¬ 
book' compufing laborafory spaces as is common 
in fhe airline indusfry for example based on fhe 
assumpfion fhaf only a fracfion of fhe sfudenfs will 
make regular use of fhe laborafory sessions in fhe 
lafer weeks. Figure 5a and ifs discussion shows 
supporfing dafa of sfudenf laborafory affendance. 
We have nof made use of fhis yef. 


4.8.5 Student satisfaction 

Sfudenf feedback on fhe aufomafic testing and 
learning wifh if has been overall very positive. 
We believe fhaf fhe increased number of practical 
exercises is an effecfive way fo educafe sfudenfs fo 
become better programmers, and if is grafifying for 
feaching sfaff fo see sfudenfs enjoying fhe learning 
experience. 


4.8.6 Software design 

Our sysfem design of having multiple loosely cou¬ 
pled processes fhaf process sfudenf submissions 
wifh clearly defined sub-fasks, and pass jobs from 
one fo anofher fhrough file-sysfem based queues 
has provided a robusf sysfem, which allowed us fo 
cormecf if wifh ofher fools, such as for example fhe 
Moodle fronf-end for code submission in Madras 
and Mandi. 


5 Summary 

We have reporfed on fhe aufomafic marking and 
feedback sysfem fhaf we developed and deployed 
for feaching programming fo large classes of under¬ 
graduates. We provided sfafisfics from one year of 
use of our live sysfem, illusfrafing fhaf fhe sfudenfs 
fook good advanfage of fhe "iferafive refinemenf" 
model fhaf fhe sysfem was conceived fo supporf, 
and fhaf fhey also benefited from increased flexibil- 
ify and choice regarding when fhey work on, and 
submif, assignmenfs. The sysfem has also helped 
reduce sfaff fime spenf on adminisfrafion and man¬ 
ual marking duties, so fhaf fhe available fime can 
be spenf more effecfively supporfing fhose sfudenfs 
who need fhis. Affempfing fo address some of fhe 
shorfcomings of ofher liferafure in fhe field as per¬ 
ceived by a recenf review article, we provided co¬ 
pious fechnical defails of our implemenfafion. Wifh 
increasing class sizes forecasf for fhe fufure, we 
foresee fhis sysfem confinuing fo provide us value 
and economy whilsf giving sfudenfs fhe benefif of 
prompf, efficienf and impartial feedback. We also 
envisage furfher refining fhe sysfem's capabilifies 
af assessing submissions in languages ofher fhan 
Pyfhon. 
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